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Abstract 

We study the estimation of (3 for the nonlinear model y — f(X T f3) + e 
when / is a nonlinear transformation that is known, [3 has sparse nonzero co- 
ordinates, and the number of observations can be much smaller than that of 
parameters (n <p). We show that in order to bound the L 2 error of the L 
regularized estimator /3, i.e., ||/3 — (3\\2, it is sufficient to establish two condi- 
tions. Based on this, we obtain bounds of the L 2 error for (1) L regularized 
d . maximum likelihood estimation (MLE) for exponential linear models and (2) 

L regularized least square (LS) regression for the more general case where / is 
analytic. For the analytic case, we rely on power series expansion of /, which 
requires taking into account the singularities of /. 
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1 Introduction 



Regularized estimation for sparse models that have a large number of parame- 
ters comparing to that of observations has become an important topic in statis- 
tics, machine learning, and a few other areas (Bunea et al. 2007, Candes & Tao 
2007, Donoho et al. 2006, Efron et al. 2004, Field 1994, Natarajan 1995, Zhao & Yu 
2006). The research in these areas has been focused on regularized least square (LS) 
regression for sparse linear models y = X(3 + e, where y £ M n is the response vector, 
X E R nxp the design matrix, (3 £ W the vector of parameters, and e G W 1 the 
random error vector that has mean given X. By sparse we mean the number of 
nonzero coordinates of (3 is much smaller than p (Wasserman &: Roeder 2009). 

On the other hand, nonlinear models such as logistic models that have underlying 
linear structures are widely used. The general form of such models is 

y = f(X T (3) + e, (1.1) 



1 



where / : K — > K is a nonlinear function that may or may not be known. Here and 
henceforth, for x = {x\, . . . , x n ) T G R n , we denote 

f(x) = (f( Xl ),...,f(x n )) T . 

The need for nonlinear models with sparse underlying linear structure is clearly 
laid out in several recent works in neuroscience (Sharpee et al. 2008, 2004) and some 
algorithms based on information criteria have been proposed to estimate not only 
(3 but also /. However, at this point, it seems very hard to evaluate the estimation 
precision of those algorithms. 

In this article we are content to establish the L2 precision of Lq regularized 
estimator of (3 for sparse models, when the design matrix X is fixed and / is known. 
We shall allow nCp. Despite its limitation from a computational point of view, the 
Lq regularization is an important and conceptually simple instrument for parameter 
estimation and model selection (Akaike 1974, Huang et al. 2008, Schwarz 1978). 
Besides, since many improvements over the Lq regularization are achieved by taking 
advantage of properties of linear models that may fail to be had by nonlinear models 
(Zhao & Yu 2006), it is reasonable to take Lq regularization as a prototype for 
further study on nonlinear models. With this in mind, our concern is whether good 
estimation precision could be achieved instead of how fast to achieve it. 

In Section 2, we establish a basic result. We show that provided two condi- 
tions are satisfied, the L2 error of the Lq regularized estimator satisfies a quadratic 
inequality which yields the estimation precision. Consequently, establishing the es- 
timation precision is reduced to establishing the two conditions. As a minor benefit 
of the result, independence of the coordinates of e in general need not be assumed. 

We will also set up notation and collect other preliminary results in Section 2. 
After that, we shall establish the alluded conditions for exponential linear models 
and for analytic models, i.e., models with analytic /. Although a special case of 
analytic models, exponential liner models are much simpler to handle due to its 
explicit expression of the conditional density of y given X. For these models, we 
consider the maximum likelihood estimator (MLE). The discussion is in Section 3. 
For analytic models, we will consider the LS regression. Sections 4 and 5 estab- 
lish the two conditions, respectively. In Section 5, the approach is to use infinite 
power series expansion of /. The main complexity of the approach arises when / 
has singularities on C. To illustrate, we will use as working examples the logistic 
regression model in Section 3 and a noise corrupted version of it in Section 5. Most 
of the proofs are collected in Section 6. 

2 Preliminaries 
2.1 Notation 

Denote by Xj, . . . , X^ the row vectors of X, with £ W 1 . Denote by V±, . . . , V p 
the column vectors of X. We shall always assume that X is fixed and impose the 
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condition that Vj 7^ 0. In fact, if a column vector of X is 0, then it has no effect 
on y and should be removed. In the subsequent discussion, the column vectors of 
X should be understood as unnormalized. It is therefore helpful to think of X as a 
collection of covariate vectors registered exactly as they are observed. 

For S = {ii, . . . ,i k }, with 1 < i x < . . . < i k < p, denote X s = (V^, . . . ,V ik ), 
and for u G W, denote us = (u^, . . . ,Ui k ) T . The support of u is 

spt(ii) = {i : Ui 7^ 0}. 

Denote by \\u\\ p the L p norm of u. If A is a set, denote by |^4| its cardinality. The Lq 
norm of u refers to |spt(u)| and is often denoted by ||u||o- We choose the notation 
|spt(ti)| since it seems more intuitive. 

For <p = (ifi, . . . , ip n ) and x £ W 1 , where each (p^ : M — ► R, denote 

if(x) = (ipi(xi), ■ ■ ■ ,ip n (x n )) T . 

2.2 General form of estimator and line of argument 

The general form of an Lq regularized estimator is 

= argmm[£(y,Xu) + c r |spt(ii)|] , (2.1) 

where D is a pre-selected search domain in R p , £(y,Xu) is certain loss function, and 
c r > is a tuning parameter. For the MLE, £(y,Xu) is the minus log likelihood, 
while for the LS regression, it is \\y — Xu|||. For linear regression, D is typically set 
equal to MP. However, for nonlinear regression, our position is that some constraint 
on D is needed in order to control the potentially large variation of the functional 
property of / at different possible values of X(3. 

For both the MLE and LS regression, the argument to establish the precision of 
(3 proceeds as follows. First, it is easy to show that (3 satisfies an inequality of the 
following form, 

G^(XP) - ^XP)) < 2\(e, ip(Xp) - <p(X/3))\ - c r (\spt0)\ - |spt(/3)|), (2.2) 

where G is a function W 1 — > R, tp = (ipi, . . . , ip n ) and ip = ((pi, . . . , <p n ), with tpi and 
(pi being functions R — > R. Then the following two conditions will be established. 

Condition HI Given q £ (0, 1), there is c\ = c\(X, (3, <p, q) > 0, such that 

Pr{|(e, <p(Xu)-(p(X0))\ < ciVn||«-/9||i, all u G D) > 1 - 2q. 

The coefficient 2 in 1 — 2q is nonessential. It is for ease of notation in the statements 
of main results. 

Condition H2 There is C2 = C2(X, 0, tp) > 0, such that for all u G D, 

G(ip(Xu) - il)(Xp)) > c 2 n\\u - p\\l. 
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The constants c\ and c 2 will be explictly constructed. In general, both depend 
on X. Since we only consider fixed design, they are nonrandom. 

We will check the conditions respectively for the MLE and LS regression. Once 
this is done, using the next result, we then obtain a bound on \\f3 — /3 1 1 2 - Note that 
the result is stated in a little more general form as it does not require that (3 be the 
one defined by (2.1). 

Proposition 2.1 Suppose Conditions HI and H2 are satisfied. If (3 G D is a 
random variable that always satisfies the inequality (2.2) with c r = 2>c\/c2, then, 
letting K r = 3c\/c2, 

Pr{||5-*<5^^l}>l-2«. 

In order for the bounds to be meaningful, we need to make sure K r is not too 
large, at least comparing to y/n. This will be the main consideration when we try 
to establish Conditions HI and H2. 

Because Proposition 2.1 plays a fundamental role in our study, we give its proof 
below. This is the only result whose proof appears in the main text. 

Proof of Proposition 2.1. Denote T = spt(/3) and S = spt(/3). Under Conditions 
HI and H2, with probability at least 1 — 2q, 

c 2 n\\P - (3g < 2 Cl V^\W-(3\\i - c r (\S\ - \T\) 



< 2 Cl ^^/\SU¥\W - 0\\ 2 - c r (\S\ - \T\), 

where the second inequality is due to spt(/3 — (3) C S U T and Cauchy-Schwartz 
inequality. Let t = \\/3 — /3 [ 1 2 and b = c\/c2- Then 



f2 _ 2b^\SUT\t + 36 2 (|5| - \T\) < Q 
ri n 



The left hand side is a quadratic function in t. In order for the inequality to hold, 
there have to be \S U T\ > 3(\S\ — \T\) and 



< t < 

in 



y / \SUT\ + V|SUT|+3(|T| -\S\) 
Let Ti = T\,Sand Si = S\T. By \SUT\ = |5i| + |T| and \T\-\S\ = |Ti|-|5i| 
0<t< 4= (VW\ + \Si\ + V\T\ + 3|Ti| - 2|5i 



It is easy to see that due to |Ti| < \T\, the right hand side is a decreasing function 
in I Si I on [0, (\T\ + 3|Ti|)/2], and hence is no greater than its value at 0, which is 
(6/V^(V^I + V|r| + 3|Ti|) < 3b^\T\/^i. □ 



To establish Conditions HI and H2, certain assumptions are needed. We next 
discuss the major assumptions used by both the MLE and LS regression. 
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2.3 Tail assumption on errors 

To establish Condition HI, we will need the following assumption on e. 
Tail assumption. There is a > 0, such that for any t, cti, . . . , a n £ R, 

Pr j(l> e ^ >t2 ^ a *} - 2exp { - ^}- (2 - 3) 

The tail assumption (2.3) rather mild. If e ~ N(0, <t 2 X) and the spectral radius 
of £ is no greater than 1, then (2.3) holds. In this case, e\,...,e n need not be 
independent. Moreover, if e» are independent, such that E(e«) = and |ej| < a for 
all i, then by Hoeffding's inequality (Pollard 1984), (2.3) holds. 

2.4 Coherence and restricted domains 

In order to identify /3, some conditions on the correlations between the column 
vectors of X are needed. The maximum correlation between columns of X is 

\V T VA 

(i(X) = sup 1 « Jl 



l<i<j<p \\Vi\\2\\Vj\\2 

Conditions on fJ,(X) are often referred to as coherence property (Bunea et al. 2007, 
Candes & Plan 2009). The following function 

n(i/) = (l-i/)[l + l/ M (X)] (2.4) 

will be regularly used in our discussion. 

Proposition 2.2 Fix v £ [0, 1]. (1) For u £ W , if |spt(n)| < n(v), then 

p 

llXugyvii + ^x^H'ml 

(%) For n, v £ W, if \spt(u) U spt(v)| < n(v), then 

\\X{u - v)\\l > u[l + n(X)] ^ - UjPH^IIi- 

j'=i 

In particular, the inequality holds z/|spt(tt)| V |spt(f)| < n(v)/2. 

As mentioned earlier, for the estimator (2.1), we need to impose some constraints 
on the search domain D. For this purpose, we define several sets. For / C R, let 

D(I) = {u £ R p : Xju £ J, 1 < i < n}, (2.5) 
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and for h > 1, let 

D (I, h) = D(I) n{fief: |spt(u)| < h} . (2.6) 

Apparently, denoting by T the mapping u — ► Iti, ©(I) = T~ 1 (I n ). 

One constraint that will be regularly imposed is -D C T)(I, n(z^)/2) for some 
£ (0,1)- The implied constraint that Xj u 6 / for every i is to make sure 
that the functions involved in the estimator (2.1), i.e., G, tpi and (/>j, have good 
enough properties for all candidate values of (3, especially properties determined by 
derivatives. This constraint on the functional properties is needed when we establish 
both Conditions HI and H2. For linear regression, roughly speaking, this is not a 
concern and one can simply choose I = M, simply because the derivative of a linear 
function is constant, and so the pertinent functional properties are uniform. 

The constraint D C n(v)/2) also imposes a constraint on |spt(/5)|. As 

Proposition 2.2 indicates, one consequence of the constraint is that any two candi- 
date estimates of j3 can be well separated by their corresponding values of Xu, so 
that a large portion of /3 can be correctly identified. For this reason, the constraint 
will be needed when we establish Condition H2. Clearly, the smaller /i(A) is, the 
milder the constraint. Under mild conditions, [i(X) can be as small as ra" 1 hip); 
see Candes & Plan (2009) and also the comments at the end of Section 3.3. This 
results in a constraint of the form |spt(/3)| < C\Jnj lnj>, which is quite mild even 
when p is much larger than n, for example, p = n a for some a > 1. 

We shall need the following properties of T> (I, h) . 

Proposition 2.3 (1) If I is closed, then T>(I,1) C 1>(1, 2) C ••• are closed and 
(2) if I is compact and h < n(0) = 1 + n{X)~ l , then T>(I,h) is compact. 



3 Exponential linear models 
3.1 Setup and main result 

Let /i be a Borel measure on K with /x(M) > 0. Suppose / C 1 is an nonempty 
open interval and {Pt : t G /} is a family of probability distributions on R, such 
that with respect to [i each P t has a density 



p t (y) = exp {ty - A(t)} , with A(i) = In 



-Jv 



H(dy) 



(3.1) 



As is well known, A G C°°{I) and for t £ I, 

E(£)=A'(t), Var(0=A"(t)>0, if £ ~ P t . 



(3.2) 



If 



fl is 



For example, if fx = N(0,a 2 ), then A(t) = a 2 t 2 /2 and P t = N(a 2 t,a 
the counting measure on {0, 1}, then A(t) = ln(l + e') and Pt is the Bernoulli 
distribution with parameter e*/(l + e'). We notice that given y, g(t) := pt(y) can 
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be ananlyticall extended to the domain {z G C : Re(z) G /}. This fact is not needed 
in the rest of the section. 

Assume that given X, yi, ■ ■ ■ ,y n are independent, such that each j/j ~ with 
ti = Xj (5. The joint likelihood of y%, . . . , y n is then 

J] exp {^X^/? - A(Xjp)} = exp i y T Xp - ]T A(X 4 T /?) I . 

i=l I j=l J 

From the expression, the Lq regularized MLE for (3 is 



P = arg max 



y T Xu - ^2 A(Xju) - c r |spt(u)| 

i=l 



(3.3) 



If G D, then 



it 



y T Xp - Y, H*iP) - Cr\Bpt(J3)\ < y T X{3 - ]T A(Xj P) - c r |spt(/3)|, 

i=l i=l 

and hence 



Y [HXjp) - A(Xjp) - K\Xjp)Xj{P - P) 
< (e, Xp-XP)-c r (\spt(P)\ - |spt(/?)|), 



where Cj = y% — E(yj) = y% — A'(Xj P) has mean for each i. It is seen that the 
inequality gives rise to (2.2) once we define 
/? 

G(x) = J2 x i, ^{z)=K{z)-K'{Xjp)z, <pi(z)=z/2, (3.4) 
i=i 

for x G R n , z G R and 1 < i < n. 

Theorem 3.1 Suppose ei, . . . , e n satisfy (2.3) /or some o~ > 0. Fix z/ G (0, 1). Ze£ 
F> = D (I, n(i/)/2) in (3.3), where n(z/) is defined in (2.4). Suppose 

6:=MA"(t) >0. (3.5) 

Fix (/ G (0,1/2). Let 

3cr 2 ln(p/(7) maxj ||V^||| 
Cr = i/<J[l mhij ||Vj-||2 

in (3.3). Then, provided P £ D, 

Pr{||3-/3|| 2 <^^^!}>l-2 9 , (3.6) 

3cjA/21n(p/g) ^/nmax,- IlK-lb 

w/iere K r = — t^ttt x ,, T , ..n — . 

z/<5[l + minj||V5||| 
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3.2 Comments 



Some comments on Theorem 3.1 are in order, many of them also apply to the results 
we shall establish later. First, on the constraint f3 £ D (I, n(z/)/2). As noted in 
Section 2.4, under mild conditions, for p with \np = o(n), n(v) >c \Jn/\np. In many 
cases, since it is reasonable to assume that |spt(/3)| = 0(1) (Wasserman & Roeder 
2009), the constraint then is very mild. 

Second, on \\@ — (3\\2, which is determined by K r y/\spt(f3)/y/n in (3.6). By (3.6), 
K r = 0(R\/\np), where 

_ y/nm&Xj \\Vj\\ 2 _ maxj H^'lb/v^ 
minj liyjflU min,- 1 1 1 1 § / ri 

Under mild conditions, R grows very slowly with n. For example, R = 1 if X is 
such that ||Vj||2 = \fn (recall all Vj £ W 1 ). We shall see such an example related 
to the logistic regression. As another example, suppose all the np entries of X are 
i.i.d. ~ Z. If Z is bounded, then clearly maxj 1 1 V^- 1 1 2 / = ^(1). If Z ~ N(Q,1), 
then for any < 77 < 1/2, 

Pr < max ||Vj||oo < y/2\n(np/rf) > > 1 — 2rj. 
[i<i<p J 

Since maxj ||Vj||2 < i/nmaxj \\Vj\\oo, then with high probability, maxj 1 1 V^- 1 1 2 / -\/^ = 
0(y/\n(np)). At the same time, given < c < E(Z 2 ), 

Prj^min \\Vj\\l < cj < pPr {Z x 2 + • • • + Z 2 n < nc) <pip(c) n , 

where V( c ) = inft>o E[e* c ~*^ 2 ] < 1. Therefore, for large n and p, with high prob- 
ability, we have maxj ||Vj||2/\Ai = 0(y/ln(np)) or even 0(1) on the one hand, 
and minj ||Vy|| 2 ! / ri > c on the other, provided Inp = o(n). In particular, suppose 
p = 0(n a ) for some a > 0. Then it is seen that R = O(Vlnn) or even 0(1), and 
hence, by (3.6), with high probability, ||/3 — (3\\2 = O (In n/y/n) or 0(yj\np/ y/n). 

Finally, the precision also depends on 5 = inf^ 6 /A"(t). To see why 5 matters, 
consider the case where A"(t) is uniformly small in an interval / that contains all of 
Xj (5. This implies that A'(i) has little change on /, so by (3.2), E(yi), . . . , E(y n ) are 
close to each other, and at the same time each ?/j has little variation. This gives rise 
to a nearly "flat" plot of yi vs Xj ft, which makes the identification of /3 difficult. 
That is to say the precision of the estimate cannot be high. Certainly, if A"(t) 
has a wide range on /, then using inf te / A"(t) to set c r can be quite conservative. 
However, as Xj (5 are unknown, it is the only way to account for all the possible 
values of Xj (3, including the least ideal one. 

3.3 Logistic regression 

Suppose yi,---,y n are independent Bernoulli random variables, such that 
Pr { Vi = 1} = e x ^/(l + e x ^), i = l,...,n. 
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The corresponding parametric family of densities is pt(y) = exp{ty — A(t)} with 
respect to the counting measure on {0, 1}, with A(t) = ln(l + e'). 

For i = I, . . . ,n, ei = yi — Pr{yj = 1} G (—1,1). Therefore, by Hoeffding's 
inequality (Pollard 1984), (2.3) holds with a = 1. Given / C M, by direct calculation, 

inf A"(i) = ^2cosh^y^ , with Mj = sup|i|. 

Given q G (0, 1), let 

n _ 121n(p/g) ^ max i ||^||| xcogh2 Mr 



v[l + /i(X)] minjU^IH 
and 



= 12^/2 ln(p/g) ^max.H^lb ^ a Mj_ 
r u[l + fi(X)] min, ||V^-||| 2 

By Theorem 3.1, if (3 G T> (I, n(z/)/2), then, with probability at least q, (3.6) holds 
for the estimator 

P = argmax jy T Xu -^m(l + e*» T ") - Cr|spt(u)| : u G D (J, n(i/)/2)| . 

If X is binary, i.e., Xu = or 1, the result can be somewhat simplified. Let 
X g R nx (P +1 ) such that X y = 2Xy -1, for j" < p and X ijP+1 = 1. Also let /3 G DJ P+1 
such that = Pj/2 for j < p and /3 p+ i = £? =1 /V 2 -' Then = x 7fi- Let 

Vi, . . . , V p+ i be the column vectors of X. Then \\VjW2 = yfn. If we regress y on X 
to estimate /3, then 

= 12 ln [(i , + !)/„] , M, 12vW) x a ^ 

+ lt(X )] 2 „[1 + M (X)] 2 

In the example, /i(X) can be very small. If Xij are i.i.d. with Pr{Xij = 0} = 
Pr{Xij = 1} = 1/2, then for any l<j<k<p+l, Vj T Vfc ~ Yli=l % where rji are 
i.i.d. with Prjrij = 1} = Pr{?]j = —1} = 1/2. By Hoeffing's inequality, given t > 0, 



Pr{ \ Vj V ^ > 4=1 = Pr 



3_ 

\Vjh\\V k h ~ V™ 
It follows that given 5 € (0,1 



J2 

i=l 



> t^fn \ < 2e 



-t 2 /2 



<g (g±i)p r ) WW > Jl ln (p±}T[ <s , 



IV1II2IIV2II2 V n 6 



Therefore, with high probability, fJi(X) = 0(y / lnp/n), which is very small for rea- 
sonably large p and n. 
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4 Least square regression: preliminaries 



4.1 Reformulation and Condition H2 

Suppose that, with X fixed, 

yi = f(xT/3)+ei, l<i<n, 
where are independent with mean 0. The Lq regularized LS estimator for (5 is 
/ g = axgmin[||j/-/(Xu)||i + c P |spt(«)|] , (4.1) 

where, as in (3.3), D is a suitable search domain in W and c r is a regularization 
parameter. If (3 G D, then 

\\y - f(Xp)\\l + cr|spt(3)| < h ~ f(X0)\\l + cr|spt(/9)|, 

and hence 

- < 2(e, - - c r (|spt(^)| - |spt(/?)|), 

which implies (2.2) once we define 

G(x) = \\x\\l rPi(z) = <pi(z) = f(z), (4.2) 

for x G K n , z G M. and 1 < i < n. By Proposition 2.1, all we need to do then is to 
find suitable constants c\ and C2 so that Conditions HI and H2 are satisfied. 
For / CM that contains at least two points, denote 

I \x-y\ J 

We start with the easier task of establishing Condition H2. 

Proposition 4.1 Let I Cl be an interval with positive length. Suppose f is defined 
on I with d(/, I) > 0. Fix v G (0, 1). Let D in (4.1) be a subset of T> (7,n(i/)/2) . // 
(3 £ D, then for G and ip defined as in (4.2), Condition H2 is satisfied with 

d(/,j)Mi + /4r)] 2 

Co — mm v 7 1 o . 

n i<i<p 

As noted in Section 3.2, under mild conditions, for large n and reasonably large 
p, C2 x 1. Therefore, by Proposition 2.1, in order for the estimate j3 to have some 
reasonable precision, the coefficient c\ in Condition HI has to be of order o{^fn). 
To this end, depending on how well the nonlinear function / behaves, some extra 
constraints need to be imposed on the domain D. Section 5 is devoted to establishing 
Condition HI for the LS regression. Below we outline the steps to be taken. 
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4.2 Observations that point to Condition HI 

Recall that Condition HI stipulates an upper bound on |(e, f{Xu) — f(X(3))\ that 
has to hold simultaneously for all u. If f{x) = x, such a bound is easy to find due 
to the conjugate relation (e, f(Xu) — f(X(3)) = (X T e, u — (3), as it then suffices 
to find a bound for ||X T e|| 00 , which can be derived from the tail assumption on e 
(Candes & Plan 2009, Zhang 2009). For nonlinear /, in general, there are no similar 
applicable relations. However, like e x /(l + e x ), in many cases, / is analytic and so 
we may exploit its power series expansions around different points. By working 
with, say f(x) = x 2 , one could imagine a kind of power series expansion 

f{Xu)=Y J M a h a ( u ), 

such that each M a is some type of (row- wise) monomial transformation of X, and 
h a (u) a vector resulting from a similar transformation of u. This makes it possible 
to rewrite (e, f(Xu) — f(X/3)) as an infinite sum of (Afje, h a (u) — h a (/3)), which 
could lead to a desirable bound. 

The method works if / is analytic on the entire C, or, more generally, when all 
the coordinates of Xu and X(3 fall into the disc of convergence of the power series 
expansion of / at 0. On the other hand, when / has poles as e x /(l + e x ) does, the 
coordinates of Xu and X(3 may fall into different discs of convergence of power series 
expansion. Roughly, to deal with this problem, our approach is to cover the line 
segment connecting Xu and Xj3 with different discs of convergence of power series, 
apply the result obtained for the case of single analytic disc, and patch together 
the resulting bounds. This turns out to account for most of the complexity in our 
treatment of the analytic case. 

One question is whether we can just use a finite Taylor expansion to derive 
bounds for (e, f(Xu) — f(Xf5)), thus dispensing with the assumption of analyticity. 
The answer seems to be no in general. Unless / is a polynomial, a finite Taylor 
expansion of f{Xu) — f{X(3) has a remainder term of the form R a (u)[h a (u) — h a ({3)], 
where R a (u) is a matrix that in general depends on u. As a result, although for 
each individual u, we can get a bound for (e T R a (u), h a (u) — h a (f3)) that holds with 
high probability, there is no guarantee to get that with high probability, the bounds 
hold simultaneously for all u, which is needed for establishing the precision of (3. 

5 Least square regression: continued 
5.1 Setup 

Let J C 1 be a closed interval with positive length. In this section, we assume 
that / : I — > R is analytic in a neighborhood of /, i.e., / has a (unique) analytic 
extension onto an open set in C containing /. This is equivalent to saying that 
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/ G C°°(I) and for each t £ I, there is r > 0, such that 
2 \ ak\f ' < oo, where = — — — £ K, 

fc=0 

oo 

and f(z + t) = g fc z fc , for all ,z g (-r, r) with z + t £ I. (5.1) 
fc=o 

The radius of convergence of the power series (5.1), henceforth denoted by g(f,t), 
can be determined by (Rudin 1987) 



g(f,t) = (lim la^A 

\ fe— >oo / 



If \z\ < g(f,t), then we say f(z + 1) has a convergent power series expansion at t. 
We will regularly use the following weighted L\ norm 

v 

\\u\\ x ^ s = s ^\u j \\\V j \\ s , u£R p , s>l. (5.2) 
i=i 

Recall that it is assumed from the beginning that Vj ^ for all j. Therefore, ||it||i, s 
is indeed a norm. Finally, if (£, || • ||) is a normed linear space, then denote by 

B(u, a; || • ||) = {v € £ : \\v — u|| < a} 

the sphere centered at u 6 £ with radius a > under the norm || • ||, and by 

5(E; || • ||) = inf{a : E C B(u, a; || • ||) for some u}. 

the infimum of the radii of spheres under the norm || • || that contain E C £. 



5.2 Single analytic disc 

We first consider the case where all f(Xju), . . . , f(X^u) have convergent power 
series expansions at 0. The main result of this section is as follows. 

Theorem 5.1 Suppose £ I and d(f,I) > 0. Fix v £ (0,1) and 9 £ (0,1). 
Suppose 

D = D(I,n(i/)/2) n {u £ MP : ||u||i )00 < 0g(f, 0)/2} 

in (4.1) and e satisfies (2.3) /or a > 0. Given q £ (0, 1), Ze£ A p = ln[p(l + q~~ )]. // 
(3 £ D, then the conclusion of Proposition 2.1 ZioZds 

oo 

ci = o-\/2A^y^ 
fc=l 

and C2 as in Proposition 4.1. 



^M| W ,0)] t -'x„-4m S ||K J || 2t 
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If / is linear, then the expression of c\ is simplified into 
d =aJ2Xp\f{<S)\ max WVA^/Vn. 

In the general case, as n _1//2fe maxj < max,,- 1 1 ] I oo j 

ci < a^2X~ p K max \\VjWoo, with K = E ^p N/»°)f' 1 ' 
i<i<P (fc-lj! 

Since £>(/, 0) = (lim k |/( fc )(0)//c!| 1 / fc )~ 1 , it is easy to see that ci < oo. As noted in 
Section 3.2, under mild conditions, maxj ||^||ao = 0(y/\n(np)). Since X p = O(lnp) 
and K is a constant, ci = O(ydn^np)lnp). Therefore, for reasonably large p, such 
as p = n a , c\ = O(Vlnn). Moreover, as seen previously, under mild conditions, it 
is possible that c\ = O(lnn). Combining the comment after Proposition 4.1, it is 
seen that the regression estimator (4.1) can have good precision. 

5.3 Multiple analytic discs 

We first need some preparation. Let N C C be an open set containing / such that 
/ has an analytic extension on J$. Let J = 3sf n M. For u E CD( J), i = 1, . . . ,n, and 
k E N, define functions, 

f( k )(X T u) 

aik(u) = rr 1 , A k (u) = max \a ik (u)\, r(u) = min g{f,X i u). (5.3) 

k\ l<i<n l<i<n 

It is easy to see that r(u) > 0. Given any function b{u) on D(J) satisfying 

< b(u) < r{u) (5.4) 
and given any set E C CD (J), denote 

b(E) = inf b(u), r(E) = inf r(u), A k (E) = sup A k (u). (5.5) 

ueE ueE u( z E 

If E is finite, then it is easy to see that r(E) > b(E), and, by lim^ \ ai k {u)\ l / k = 
l/g(f, Xj u) for u E CD (J) and i = 1, . . . , n, 

HE A fc (£)V* = max Ik loifeCu)! 1 /* = -L (5.6) 

k—*oo u£E k— »oo r^ii- J 

l<i<n 

Let G be a subset of CD (J). If 

Ec[jO u , with U = B(u, 6(u)/2; || • ||i )00 ), (5.7) 

then G will be referred to as a "fo/2-covering grid", or simply "covering grid" for E. 
By this definition, for each point u in a covering grid and i = / is analytic 
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at Xju with g(f,Xju) > b(u). Note that a covering grid of E need not be its 
subset. If E is compact, it always has a finite covering grid. 
Finally, for E cW, denote 

C(E) = {(1 -s)u + sv.se [0, 1], u, v £ E}, 

i.e., the union of all the line segments connecting pairs of points in E. If E is 
bounded (resp. compact), then C(E) is bounded (resp. compact). If |spt(u)| < a 
for every u £ E, then |spt(-u)| < 2a for every v £ C(-E). However, C(E) may not be 
convex, and for unbounded closed E, C(E) may not be even closed. 
After all the preparation, the main result can be stated as follows. 

Theorem 5.2 Suppose I is compact and d(I,f) > 0. Fix v £ (0,1). In the 
regression (4.1), let D be a closed subset ofT) (I, n(^)/2). Fix b(u) satisfying (5.4). 
Let G be a finite b/2-covering grid of C(D). Given q £ (0,1), let X p = lnp(l + q~ l ). 
If (3 £ D, then the conclusion of Proposition 2.1 holds with 



oy = v^ctV kJhi \G\ + k\ p A fc (G)b(G) fc ~ 1 x max \\VA\ 2k 



fc=i 

and C2 as in Proposition 4.1. 



(51 



To get ci, it is enough to assume D is a compact subset of T)( J). The stronger 
assumption that D C D(I, n(z/)/2) is needed in order to get both c\ and C2- By 
Proposition 2.3, T> (I, n(v)/2) is compact. Therefore, if D C T> (I, n(u)/2) is closed, 
it is compact as well. 

Unlike in Theorem 5.1, here c\ depends on |G|. In order for the regression 
estimator (4.1) to have good precision, |G| has to be controlled. The smaller |G| is, 
the higher the precision we can claim for (3. To see what might be an acceptable 
level of \G\, observe that 

ci < V2aK\hi\G\ + A„ max ||Ki||oo = O ( A/m(p|G|) max \\Vj ||oo ) , 
v i<i<p V 3 J 

where K = k 3 ^ 2 Afc(G)b(G) fc ~ 1 is finite by (5.6). From the comment after Propo- 
sition 4.1, it is seen that (3 has good precision if ^/ln(p|G|) maxj ||Vj||oo = o(^/n). 
Provided maxj ||V^||oo = 0(y/ln(np)) and p = n a , this implies there should be 
In | G| = o(n/lnn). Certainly, |G| depends on the choice of the search domain D in 
(4.1) and the property of /. We next get some upper bounds of |G|. 



5.4 Upper bounds on the cardinality of covering grid 

We follow the notation in Section 5.3. Recall that / is analytic on some open domain 
3sf C C containing / = [a, b] and J = Wnl. The next result says that |G| can be as 
small as 1 in Theorem 5.2. It follows directly from the definition of covering grid. 
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Proposition 5.3 Let D C B(w, d/2; \\-\\i )OC ) for some w £ T>(J) andO < d < r(w). 
Then for any b satisfying (5.4) and d < b(w), {w} is a b/2-covering grid for C(D). 

As an example, if / is analytic in a neighborhood of and ||n||i j00 < ^ g(f ,0)/2 
for all u £ D, where < < 1/2, then, since r(0) = g(f,0), {0} is a 6/2-covering 
grid of C(D) for any b satisfying (5.4) with 6(0) > 9g(f,0). 

We next consider more general cases. For ease of notation, for E C W and 
S C {1, . . . ,p}, denote 5(E) = 5(E; \\ ■ \\ lj00 ) and E s = {u £ E : spt(n) C S}. 

Proposition 5.4 Fix b(u) satisfying (5.4) and h & N. Let D C T> (I, h/2) be 
compact and K = C(D). 

(1) If J = M. and d b := inf u6 >jv n b(u) > 0, then K has a b/2-covering grid with 
cardinality no greater than 

]T [25(K s )/d b + l] h < ( P \ [25(D) /d b + l] h . 

\S\=h: K S ^% ^ ' 

(2) In general, if d b := inf ug x>(/ i h) b(u) > 0, then K has a b/2-covering grid with 
cardinality no greater than 

£ [46(K s )/d b + l] fc < (j) [4<5(D)M + l] h . 

Note that, since / is compact, inf^gjvj m r(u) > inf-^g/ a;) > 0, so there are 
always functions b(u) satisfying (5.4) and d b > 0. For example, b(u) = r(u)/2. 

Finally, in Theorem 5.2, c\ depends on the choice of G, so it may not be easy 
to use. Using the above bounds on \G\, we have some more convenient choices for 
ci, although they are larger than the one in (5.8). 

Proposition 5.5 Let D be a compact subset of D (I, h/2) in regression (4.1). 

(1 ) Let 4 — sup xg j \ f( k \x)\/k\ and go — inf^j g(f, x). Suppose J — R, go > 0, 
and for any g~\ £ (0,g~o), su P|im(«)|<§i < 00 • Then the radius of convergence 

of X]fc>l dkZ k is go and given g± £ (0, go), c\ in (5.8) can be set equal to 



ci = V^ct ^2 

k=l 



hln(pQ) + kXpdkQi x n 2fe max ll^j'lbfc 



(5.9) 



w/iere Q = 25(D) /g x + 1. 

f£j iei dk = sup xgJ - |/( fc )(x)|/fc! and go = inf xg / g(f,x). Then go > is equal to 
the radius of convergence of ^2 k>1 dkZ k , and given g\ £ (0, £>o)j c i ^n (5-8) can be 
set equal to 



ci = v^o-V^ k\ hhi(pQ) + k\ p dkg k ~ l x n~2fc max ||Vj;|| 2fc 

K = l 

w/iere Q = 46(D)/ qi + 1. 



(5.10) 
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In (5.9), because the radius of convergence of Ylk>i^k zk is QOi c i < °°- As 
X p = ln[p(l + q~ 1 )}, Cl =0(^E pmaxj HVjlloo). Therefore, under mild conditions, 
for large n, as long as h is not too large, the regression (4.1) still has good precision. 

5.5 Logistic regression with binary noise 

Let y\, . . . ,y n be the same random variables as in Section 3.3. However, we only 
see their randomly "flipped" versions z\, . . . , z n G {0, 1}, such that 

n 

Pr{zi, ...,z n \y 1 ,...,y n } = Y[p yiZi , 

i=i 

where p a b > and p a Q + p a \ = 1 for a = 0, 1. Suppose all p a b are known. The 
regression model now is E(zj) = f(Xjf3) with 

1 + e £ 

If po\ = pn, then Zi is independent of yj with Pr{zj = 1} = pn, making inference 
impossible. Therefore, we will assume A p = \pn — poi\ > 0. 

Since / is analytic on C \ {tk, k G Z}, where tk = (2k + l)7ri, we shall apply 
Proposition 5.5(1). First, since = zi — E(zi) are independent and |ej| < 1, they 
satisfy the tail assumption (2.3) with a = 1. Since g(f,x) is the distance from z 
to the closest pole, for any x G K, go = |0 — ti| = it. Simple calculation gives 
f'(t) = (p n -p i)[2cosh(t/2)]- 2 . By 2| cosh(a + bi)\ > -e"l a l for a, b G R, it is 
easy to see that for y G (0, it), 

M(y) := sup |2 cosh(z/2)|~ 2 < oo, 

\lmz\<y 

Fix qi G (0, it), r > and G (0, 1). Let / = [— r, r] and 

D = {u G M p : |M|i,oo < r, |spt(it)| < n(z/)/2} . 

Apparently, D C CD (/, n(i/)/2) and <5(-D) < r, where, as in Proposition 5.5, 6(D) = 
S(D;\\ ■ \\i,oo). 

Let 9 G (qi/tt, 1). For any i£l and k > 1, by Cauchy's contour integral, 
l/ (fc) (*)l < 1 / |f(z)|dz ApMfa/e) 

k\ ~ 2kn J lz _ xl=Bl/e (- ei /9Y ~ Kei/ey-i > 

giving dfc < A p M(^i/0)/[A;(^i/0) fc_1 ]. Therefore, by Proposition 5.5(1), 

k=l 



JR + k\ p 9 k 1 x n 2fe max ||K'|| 2 fc , 
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where R = n(z/)/2 x ln(2rp/gi +p). On the other hand, given r > 0, 

m(r) := inf |2 cosh(x/2)|~ 2 > 0. 

x£[— r,r] 

Therefore, by Proposition 4.1, 

A>(r)Ml + M*)] . I|T/I|2 
C9 > mm v.- U, 

n i<j<p 

Similar to Section 3.3, if all the entries of X are ±1, then the results can be 
simplified so that D = {u G MP : ^ < r, |spt(it)| < n(z/)/2}, and 

oo 

ci < v^Ap M(&/0) E + fcA P c 2 > A5m(r) 2 z/[l + /i(X)]. 

fe=i 

6 Technical details 
6.1 Preliminary results 

Proof of Proposition 2.2. (1) Let 5 = spt(u). If \S\ = 0, then u = and the 
inequality trivially holds. Suppose |5| > 1. Since Xu = J2jes u jVj> 

\\Xu\\l = Kfll^lli + Yl UiUjV^Vj 

^ E K'l 2 H^'ll2 - K x ) E KIKIII^Ibll^lk 
= [i + n(x)] Y hfll^lll - ri x ) I E K'HI^'li2 

ieS \jG9 
By Cauchy-Schwartz inequality, 

\\x u \\l > (i + /i(x) - fi(x)\s\) £ KflNli- 

Since |5| < n(i/) = (1 - i/)[l + l//u(X)], then 1 + fi(X) - fi(X)\S\ > u[l + n{X)], 
which implies the desired inequality. 

(2) By spt(w — v ) C spt(u) U spt(w) and the assumption, |spt(tt — v)\ < n(v). 
The inequality then follows from (1). □ 

Proof of Proposition 2.3. (1) Because I is closed and the mapping T : u — > Xu is 
continuous, D(I) = T~ 1 (I n ) is closed. Also, V h := {u G W : |spt(u)| < /i} is closed. 
Thus D (J, /i) = D(J) n V h is closed. It is easy to see that X> (J, h) C T> (I, h') when 
/i < /i'. 
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(2) Because of (1), to show that T> (J, h) is compact for h < n(0), it suffices to 
show the set is bounded. Since h < n(0), there is v G (0, 1) such that h<n{y). Let 
u £ T> (I,h). Then |spt(u)| < n(z^), so by Proposition 2.2, 



2 + //(X))min 1 <j< p ||T^- HI 

Since Xju G / for each z, then ||Xit||| < nmaxj IX^u) 2 < resup^gj |rr| 2 . Because / 
is bounded, it is seen ||u|| 2 is bounded for u G D (I, h). □ 



6.2 Exponential linear models 

In this section, we prove the next two lemma. 

Lemma 6.1 Condition HI is by satisfied ip = {ip\, ■ ■ ■ , (f n ) with 



Mp/q) 

2n i<T<p ' 



c\ = a\j — — — max ||^||2- (6.1) 



Lemma 6.2 Condition H2 is satisfied by G and ip = {ip\, . . . ,ipn) with 

C2 = mm \\Vj U. (6.2) 

By Proposition 2.1, if c r = 3c 2 /c2 in (3.3), then (3.6) holds with k t = 2>c\/c2- 
Therefore, once the lemmas are proved, we get the expressions of c r and K r as in 
Theorem 3.1. 

As in (3.4), let G(x) = x\ + • • • + x n for x G R n , and <fi(z) = z/2, if>i(z) = 
A(z) - A'(Xj(3)z for 1 < i < n and zeR. 

Proof of Lemma 6.1. By (2.3) and e T Vj = Y17=i Xijti, 

Pr{|e T ^| < v / 2lm>7^)a||^|| 2 , all j = 1, . . . ,p} 

v 

> 1 - ^Pr{|e T ^| 2 > 2ln(p/q)o- 2 \\Vj\\l} > 1 - 2q. 
i=i 

Consequently, with probability at least 1 — 2q, 



\\X T e\\ cx> = max le" 1 "^! < y/2\n{p/q)o~ max ||V<||2 = 2c\\fn, 
i<j<p i<i<p 

which implies condition HI due to the fact that for all u G W, 

\{e, <p(Xu) - <p(X0))\ = i|(e, Xu - X(3)\ 

= ±\(X T e) T (u-(3)\<±\\X T e\\ 00 \\u-(3\\ 1 . □ 
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Proof of Lemma 6.2. Given u G V (J, n(i/)/2), for t G [0, 1], let 

n 

h{t) = Y,M^-t)Xj(i + tXju), 



which is well-defined as (1 - t)Xj f3 + tXju G /. Let A = G(ip(Xu) - ip{X(3)). 
Then 

n 

A = J2 [Uxju) - MxlP)} = Hi) - Mo). 

i=l 

Observe that ^[{Xj (5) = 0. Then ti(0) = Ei X 7( u ~ PM( X 7 P) = 0, so 
by Taylor expansion, A = h"(r)/2 for some r G (0,1). By ^'((z) = A"(z) and 
inf te jA"(t) = <5>0, 

1 n 

A = - " - t)Xj(5 + tXju) 

>^ [X T (tt _^ ]a = «-M. 

i=l 

By |spt(it -/?)|< |spt(«) U spt(/3)| < n(i/) and Proposition 2.2, 

&/[!+/*(*)] i fl |2|| T/ | |2> Ml+MWl - ,| V |,2 V || R u2 

3=1 ~ 3 ~ n 

and so Condition H2 is satisfied with C2 set as in (6.2). □ 

6.3 Proofs for LS regression: the case of single analytic disc 

First, we establish Condition H2. 

Proof of Proposition 4.1 . For i = 1, . . . , n and u € D, since XT/3 G J and Xj u G /, 

n 

\\f(xu) - f(xp)\\l = \fixju) - f(x7P)\ 2 
t=i 

n 

> 2 d(/, I) 2 |^ T u - X7/3| 2 = d(/, I) 2 ||X(u - /3)|| 2 . 
i=i 

Since |spt(u — /3)| < |spt(u)| + |spt(/?)| < n(i/), then by Proposition 2.2, 

\\f(Xu) - f{Xf3)\\ 2 2 > d(/, I) 2 u[l + f,(X)] mm ||^||| x || u - /3|| 2 . 

i<j<p 

Because the right hand side is C2n||u — /3H 2 ., the proof is complete. □ 

The main result in this section is Proposition 6.5, which together with Proposi- 
tion 4.1 immediately leads to Theorem 5.1. For brevity, in the rest of this section, 
we shall denote IT = {1, . . . ,p}. 
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,x n ) T 6 1", 



6.3.1 Power series expansion and tail assumption 

To facilitate subsequent discussions, we first consider 

ip{x) = Ol(xi), • • • , <Pn{x n )) T , X = (Xi, 

where (p%, . . . , <p n are real- valued functions that may be different from each other. 
Suppose each ipi can be analytically extended to a neighborhood of in C. Let 

^ (fc) (o) _ m 



(6.3) 



Then g(ipi,Q) = (lim& |aa-| 1//fc ) _1 - Since we are interested in ip(Xu) — (p(Xv) instead 
of (p(Xu) itself, without loss of generality, let (fi(0) = 0. 

For vector v = (v±,..., v p ) T and fc-tuple a = (a±, . . . , a k ) G n fc , denote by v a 
the product of v ai , . . . , v ak . For example, if p = 3 and k = 4, then un 3 1,2) = 
U1U3U1U2 = V1V2V3. With this notation, for i = 1, . . . , n, Xi a = X{ ax • • ■ Xi ak . For 
each j = 1, . . . ,p, let nj(a) = \{i : a, = j}\. Clearly, ni(a) + • • • + n p (a) = k. 

By (6.3), for i = 1, . . . , n, provided |-2Q r it| < £>(¥>i, 0), 



ipi{Xju) = ^a ik (Xju) k = ^a ik \ ^ X ia u a 
k=l k=l \a£U k 

Therefore, if |-X^ u| < g((fi,0) for all i, then 



(e, <f{Xu)) = ^eiifi^Xju) = ^ Ua ^ e i a ik X ia • (6.4) 
i=l k=l a &U k V i=l / 

Lemma 6.3 Suppose e satisfy (2.3). Let q\, qi, ...> with q := ^2 k q k < 1/2. 
Given real numbers 6i k , 1 < i < ra, 1 < < p, consider the condition 



i=l 



< aJ2\n(p k /q k ) 



V 



2 

ia' 



where a is the constant in (2.3) and InO is defined to be —00. Then 
Pr j (6.5) ZioZds for all fc > 1 and a G n fc } > 1 - 2g. 

Proof. The left hand side of (6.6) is at least 

00 

1 — Pr{(6.5) does not hold for A; and a } 

fc=i aen fe 

Since |LT fc | = p k , it suffices to show that for each k and a = (a±, . . . , a k ), 
t 2 n 



Pr 



i=l 



>2a 2 ln(p k /q k )Y,8*kXl\<2p- k 



Ik, 



i=l 



which directly follows from (2.3). 



(6.5) 



(6.6) 



(6.7) 
□ 
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6.3.2 Establishing Condition HI 

Recall the following multinomial formula: for any j = 1, . . . ,p, 



nj(a)x"' (a)-1 JJ^™ 4 ^ = K*i + -'- + x. 



k-l 
VI ' 



as the left hand side is equal to 

k 



E 



n 



8 

dxj 



E 

-•+* 

For each j = 1, . . . ,p, let 



/V II 1 

v A? 1 " " " /vr> / 
A,'iH Vk p =k x ^ y s^j 



dxj 



Wjfc — o lfc A l3 - H h a nk Ji nj . 



(6.8) 



(6.9) 



Lemma 6.4 Suppose that, with 0^ = a^, (6.5) ZioWs /or all /c > 1 and a G II fc . 

Given u and v, let dj = \uj — Vj\ and rrij = \uj\ V \vj\ for j = 1, . . . ,p. If 



max — — —r < 1, 



^ l<i<n e(y»i,0) 

i/ien, letting £ = (e, ip(Xu) — ^(Xd)), 



\Z\<aV2j2 



k=l 



k-l 



k^ln(p k /q k ) ^mjuj 



J =1 



Proof. By (6.10), for any i, 



\X, 



j=l j=i VWi, ) 

and likewise |A^ T t;| < g(cpi,0). Therefore, by (6.4), 



« = EE 

k=i aen* 



(u a - v a )y~]eiaikXi, 



(6.10) 



(6.11) 
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By the assumption, (6.5) holds with 0^ = for all k > 1 and a G Tl k . Thus 

n 



fc=i «en fe 



i=l 



<XW21n(p*/9k) 



k=l 



(6.12) 



where M a = a a^ 2 a - Given A; > 1, for each q G II fc , by ni(a) H hn p (a) = /c 

and Cauchy-Schwartz inequality, 



n p P / n \ nj(a)/k p 

2rij (a) ^ "TT / _2 ^2fc 1 - TT, nj(a)/k 



i=l j'=l j'=l \i=l 



3=1 



where the last inequality is due to the notation in (6.9). On the other hand, 



< 



p p 

u/ -11V 

3=1 
P 

E 

j'=i 



3=1 



_ TT l„, \n s (a) 



lis (a) 



8=1 



s=j+l 



3=1 



Therefore, 

aen fc 



aen fe 



rij(a)dj no j 

3=1 



m , yW- 1 T7 m ».( a ) 



n 



n,(a)/(2fc) 



3=1 ^aGn fe 

\ fe-1 



1 


n (a) — 1 


1 




n 











3=1 
n s (a) 



i 



3=1 



where the last equality is due to the multinomial formula (6.8). Now by (6.12), the 
inequality in (6.11) is proved. □ 



Proposition 6.5 Fix 6 G (0, 1). Let D = {u£W : \\u\\ 1)O0 < 9g(f, 0)/2} in Con- 
dition HI and e satisfy (2.3) for a > 0. If (3 G D, then Condition HI is satisfied by 
setting c\ as in Theorem 5.1. 
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Proof. We have ipi = f and g(npi,Qi) = g(f,0). For u £ D, let d = u — (5 and 
m = (mi, . . . , m n ) T , with rrij = \uA V |/5U. Then 



|m||l,oo < |M|l,oo + ||/3[|x,oo < 6qU,0). 



(6.13) 



As a result 



Em,- max 
J Ki 



|m||i 



<«e(/,o) e (/,o) 



< 



and (6.10) is satisfied. Let qk = (j^) k - Then ^2 k qk = Q, so by Lemmas 6.3 and 
6.4, with probability at least 1 — 2q, (6.11) holds. For each k > 1, by the notation 
in (6.9), uj jk = (|/ (fc) (°)IA ! ) 2 ||^jll!fe- Recall that in Theorem 5.1, X p is defined to 
be ln[p(l + g -1 )]. Since y / ln(p k /q k ) = y/k\ p , 



k-i 



k^Hp*/q k ) [ 



i=i 



(fc-l)! 



X ll"l|ll,2fcll d lll,2*> 



where the weighted L\ norm || • ||i )S is defined in (5.2) and satisfies 



U\\l,s < < 



n 1/s ||w||i,oo, 
max llVolL x ||u||i . 



s > 1. 



Then by (6.13), 



m 



ifc-ii 



fc-i 



x max \\Vj\\ 2 k x 
i<i<p 



ll j2 fcll«*lll,2* < (« 2fc IMIl,a 

< y/n [0g(f, O)]^ 1 x max \\VjW2k x 



i<i<p 



Together with (6.11) and (6.14), this yields the proof. 



(6.14) 



□ 



6.4 LS regression: multiple analytic disc case 
6.4.1 Proof of Theorem 5.2 

We first restate Lemma 6.3 as follows. 

Lemma 6.6 Let e satisfy (2.3). Let E C D(<7) be finite and for k > 1 and u £ E, 
1st Qk,u > 0, such that q : = ^2 ueE qk,u < 1/2. Consider the condition 



eiaik(u)Xi, 



i=l 



< a J 2\n(p k / q k , u ) 



\ i=l 



2 

ia ' 



(6.15) 
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where a > is the constant in (2.3). Then 

Pr{(6.15) holds for all k > 1, a G LT fe , and -a G -E 1 | > 1 — 2g. 

The next result provides a bound on |(e, f(Xu) — f(Xv)}\ for suitable u and v. 
The method of its proof is describe at the end of Section 4. 

Lemma 6.7 Given b(u) satisfying (5.4), let G be a finite b/2-covering grid of a 
set K C £>(</)• Fix q& > such that q := ^2f~qk < 1/2 and Inq^ = O(k) over 
3 = {i £ N : A fc (G) > 0}. Suppose that, with E = G and q k)U = qk/\G\, (6.15) 
holds for all k > 1, a £ tf, and u G G. If u, v £ K and the entire line segment 
connecting them is in K , then, letting £ = (e, f{Xu) — f(Xv)) and d = v — u, 

\£\<aV2^H(b(G),d) (6.16) 

where H(b(G),d) < oo, with 

OO 

H(z,d) =^£yin|G| + ln(pV<Zfc) A fe (G) x n"3* ||d||i, 2 * x z*" 1 . 
fc=i 

Proo/. Since G is finite, b(G) < r(G). Given 77 G (0, r(G)/b(G) - 1), let 

7?b(G) • 

By the assumption, u + 6d £ K for # G [0, 1]. For i = 0, . . . , T, let u® =u + td/T. 
Then u (0) = u, u( T ) = v, and G Fix t = 1, . . . , T. Then 

||« ( * ) -^- 1) ||i > oc = N|i,oc/r<r ? b(G)/2. 

By the definition of G, we can find some w G G, such that \\u^' — w\\i )00 < 
b(G)/2. Then HuC*" 1 ) - w|| 1|00 < (1 + r/)b(G)/2. Let <^(x) = (pi(xi), . . . , <^„(^)) T , 
with 

^(z) = /(z + Xjw) - f(Xjw), 1 < i < n. 
Let £t = ift' — w, v = t^* -1 ) — w. Then 

v(Xu) = f{Xu®) - f(Xw), if(Xv) = f(Xu^) - f(Xw), 
and, as shown just now, 

IN|i i00 < (1 + ??)b(G)/2, \\v\\ 1)00 < (1 + ??)b(G)/2. 
Let m = (mi, . . . , m p ) T with m, = |ttj| V From the above equalities we get 
||m||i j00 < \\u\\ 1)OD + \\v\\i,oo < (1 + ^)b(G), (6-17) 
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and hence, by g(<Pi,0) = g(f,X i w) > r(w), 

V- \Xjj\ ^ \\m\\ hoo (l + r?)b(G) 

> m, max — — ^- < — < — < 1. 

J i<i<n g(ipi,0) r(w) r(G) 

Now Lemma (6.4) can be applied to ip, with u, v, and g& therein replaced with 
u, v, and qk/\G\, respectively. Then 



r(*) 

fc=l 

where 

fe-i 



= E m i w iSE E 

n 

with u jk = Y,atkW\X*i\ 2k < ^(GWjWlt < nAKGWj 
i=i 

Since u<$ - n^" 1 ) = d/T, it follows that 

Mf < [£m;n^Af(G)||^||J x ^M^^-ll^ 



2k 
J lloo • 

i=l 



k-1 i 

P 



i=i / j'=i 

<v^M[ ( i + , )b(G)] *-i x „-i Mlia , 
where the last inequality is due to (6.17). Consequently, 

T 

|£| < E |< e ' " = ^((1 + »7)b(G),d). 

t=l 

By (5.6) and lng^ = 0(k) over 0, the radius of convergence of the power series 
defining g(z) = H(z,d) is r(G) > b(G). As (1 + rj)b(G) < r(G), we can let 77 -> 
and apply dominated convergence. The proof is then complete. □ 



Proposition 6.8 In Condition HI, let D be a compact subset ofD(J). Suppose 
e satisfies (2.3) for some a > 0. Let G be a finite b/2-covering grid of C(D). If 
/3 G D, then Condition HI is satisfied by setting c\ as in Theorem 5.2. 

Proof. Since C(D) is compact, it indeed has a finite 6/2-covering grid, justifying 
the assumption on G. As in the proof of Proposition 6.5, let q^ = (j^r ) k - Then by 
Lemmas 6.6 and 6.7, with probability at least 1 — 2q, (6.16) holds. The rest of the 
proof follows that for Proposition 6.5 and hence is omitted for brevity. □ 
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Proof of Theorem 5.2. First, byDc D (I, n(z/)/2) and d(J, /) > 0, Proposition 
4.1 can be applied to yield C2- Second, C(Z?) is compact and since I is an interval, 
C(D) C Then C(£>) C K(J). Proposition 6.8 can be applied to K = C(D) to 

get c\. □ 



6.4.2 Other technical results 

Proof of Proposition 5.4. Because D C T> (I,h/2) and is compact, K = C(D) C 
D (I, /i) and is compact. 

First, fix S with \S\ = h and K s / 0. Let ip s ■ W -> R 5 be the natural 
projection and zg : — ► R p the immersion, such that is{y) = z£ M p , with zj = yj 
for j G 5 and z,- = for j S". Define the weighted Li norm || • \\g on such that 
IMIs = Yljts \ u j\ ll^jlloo- For ease of notation, denote Bs(w,a) = B(w,a;\\ • \\s) 
and 5s(E) = 5(E; \\ • \\g). Likewise, denote B(w,a) = B(w,a; \\ • ||i )0 o 

) and 5(E) = 

S(E;\\-\\ 1>00 ). 

Fix d > 0. Later we will set d to specific values. Let E = ips(Ks)- It is easy 
to verify that 5s(E) = 5(Ks)- By simple geometric argument, it is seen that E 
can be covered by no more than [6(Kg)/d + l] h spheres Bs(u k ,d), with each one 
intersecting with E. Let u k = is(u k )- 

In case (1), let d = db/2. By J = R, / is analytic at every Xju k . Then, by 

K s = i s (E) C \Ji s (B(u k ,d)) C \jB(u k ,d) C \jB(u k ,b(u k )/2), 

k k k 

ui, . . . , u m is a 6/2-covering grid of K$- 

In case (2), Let d = cZf>/4. Since / may not be analytic at every Xju k , we 
cannot directly take u\, . . . , u m as a covering grid. For each i = 1, . . . , m, choose an 
arbitrary G Bs(u k ,d) n i? and let u>fc = is(w k ). As C J^g, / is analytic at 
every Xjw k . It is easy to check that Bs(w k ,2d) contains Bs(u kl d). Therefore, 

K s = i s (E) C (J 2d)) C [J B(^, 2d) C [J 6(u fc )/2), 

so w±, . . . , w m is a 6/2-covering grid of Ks- 

Denote by Gs the covering grid as above in either case. As K = U|s|=h -^-S 1 ' 
C = U|S|=/i-jf s ^0 is a 6/2-covering grid of K and 

|G| < Yl \°s\ 

\S\=h:K s ^<b 

We already know \G S \ < [5{K s )/d + l] h . By S(K S ) < 5{K) = 5(D), 

\G S \ < [5(D)/d + l] h . 

Finally, there are at most (?) subsets S with \S\ = h and Ks ^ 0- The proof for 
the bounds on |G| is thus complete. □ 
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Proof of Proposition 5.5. (1) If c > go, then there is t G J such that g(f,t) < c. 
Since l/W^/jfe!] 1 /* = g(f,t)~\ 



lim 4c fc > lim iffM^ = oo. 

k— >oo fe— >oo ft! 



Therefore, the radius of convergence of J2k>i ^k zk is at most go. To show that 
the radius of convergence is go, it suffices to show that d^c k is bounded for any 
c G (0,^o)- By assumption M := sup| Im ( 2 )|< c |/'(z)| < oo. Fix x G R. For any z 
with |z — x\ = c, |Im(z)| < c. Therefore, by Cauchy's contour integral, 



\f( k )(x)\ 



A;! 



1 / f'(z)dz 

2kK\T = l J\z- X \=c (Z - x) k 



< J_/ j/(*)j^< M 



2fc7r J\ z — X \ =c \z — x\ k kc k 



Take supremum over Then we get d^c k < M/c < oo for all k > 1. 

From the definitions in (5.3), it is clear that Ak{u) < dfc and r(u) > £>o for 
u £ D. Given any £i G (0, f?o)> let K u ) = Qi- By Proposition 5.4 (1), there is a 
6/2-covering grid G for C(D) with |G| < p h (25(D)/ go + Therefore, ci can be 
set as in (5.9). 

(2) For each x £ I, g(f,x) > 0. Since / is compact, it is covered by a finite 
number of intervals (xj — g(f,Xi)/2,Xi + g(f,Xi)/2). Let c = min^ g(f,Xi)/2. Then 
c > 0. For any x £ I, there is X{ such that \x — Xj| < £)(/, Xj)/2. Then for any z G C 
with \z — x\ < c, 1 2 — Xi\ < g(f,Xi) and hence / is analytic at z. As a result, / is 
analytic in the disc centered at x with radius c, and so g(f, x) > c. This leads to 
go = inf^g/ g(f,x) > c. For c G (0, go), since I c = {z G C : \z — x\ < c for some 
x G /} is compact, M := sup^ g/c |/'(^)| < oo. Using Cauchy's contour integral as 
in (1), it can be shown that go is the radius of convergence of ^fc>i dkZ k . The rest 
of (2) can be proved following the argument for (1). □ 
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